ISYE6501 Notes

Week 1

Module 1 - Intro & What is Modeling

What is modeling?

  1. Describe a real-life situation mathematically
  2. Analyze the math
  3. Turn mathematical answer back into real-life situation

Can mean 3 things:

  1. General (regression)
  2. General with variables (regression using size, weight distance)
  3. Specific estimates (regression using bias + coefficients * variables)

Module 2 - Classification

Error & cost:

  • Minimizing errors: Classification lines try to divide data points with the max amount of distance to account for small data variations that might lead to classification errors.
  • Should also consider weight of errors, between FPs and FNs. The more costly the error for either, the more distance we want to shift the line away from it. - Reflects "risk tolerance". Relevant in medical analytics.

Features:

  • If you see a straight horizontal/vertical line between 2 features, it means that the feature parallel to the line is not useful.

Data Definitions

Tables

  • Every row is a "data point"
  • Each column of a data point contain "features", "attributes", "covariate", etc, ie. the X values.
  • Response/outcome is the "answer" of each data point, ie. the y value.

Structured Data

  • data stored in a structured way, either quantitatively (age) or categorically (gender, hair color)

Unstructured Data

  • Data not easily described and stored.
  • Example: Text

Common types of structured data

  1. Quantitative - numbers with meaning
  2. Categorical
    • Numbers without meaning (ex. zipcodes)
    • Non-numeric: Hair color
  3. Binary data (subset of categorical data)
    • Can only take one of two values
    • Can sometimes be treated as a quantitative measure
  4. Unrelated data - no relationship between data points
  5. Time series data - same data recorded over time (stock, height of a child)
    • Often recorded at same interval (but not necessarily)

Support Vector Machines

Applied to credit loan problem:

m = number of data points
n = number of attributes
x_ij = jth attribute of ith data point
  x_i1 = credit score of person 1
  x_i2 = income of person i
y_i = response for data point i
  y_i = 1 (if data point i is blue, -1 if data point i is red)
line = (a_1 * x_1) + (a_2 * x_2) + (a_n * x_n) + a_0 = 0

Notes:

  • Need to scale the data. If not, certain coefficients will be far more sensitive to change than others.
  • If a certain coefficient (feature) is near-zero, it is probably not relevant for classification.
  • Works the same in more dimensions (ie. more attributes)
  • Classifier doesn't have to be a straight line. Can use kernel methods to make non-linear classifiers.
  • Logistic regression is better for getting probability of classification (e.g. 37% likely to default).

Scaling data

Scaling linearly:

alt

Scaling to a normal distribution:

alt

Which method to use? Depends:

  • Use scaling for data in a bounded range.
    • neural networks
    • optimization models that need bounded data
    • batting average (0.000-1.000)
    • RGB color intensities (0-255)
    • SAT scores (200-800)
  • Use standardization for PCA / Clustering.
  • Sometimes it won't be clear. Try both!

Handwritten notes for SVM

img img

KNN Algorithm

Idea:

  1. Find the class of a new data point
  2. Pick the k-closest points ('nearest neighbors') to the new one
  3. The new point's class is the most common among the k-neighbors

Things to keep in mind:

  • Can be used for multi-class classification tasks.
  • "K" is a parameter that can be tweaked.
  • There is more than one way to calculate distance metrics.
  • Some attributes can be weighted by importance.

alt

Classification Summary

  • Divides data points into groups based on similarity
  • Graphical intuition
  • Basic solution methods
    • SVM
    • KNN
  • Data terminology
  • Validation
  • Distance metrics
  • Confusion matrices

Week 2

Module 3 - Validation

Data has 2 types of patterns:

  1. real effects - real relationship between attributes and response
  2. random effect - random, but looks like a real effect

"Fitting" matches both:

  • real effects: same in all data sets
  • random effects: different in all data sets
  • BUT: only real effects are duplicated in other data.

How to measure a model's performance:

  • larger set of data to fit the model
  • smaller set to measure the model's effectiveness

Train/Validate/Test to choose best model

What if we want to compare 5 runs of SVM and KNN?

alt

Problems:

  • Observed performance = real quality + random effects. High-performing models are more likely to have above-average random effects.
  • Just choosing best performing model is too optimistic.

Solution:

  • Use both validation and test data set.
  • Training:Build, Validation:Pick, Test:Estimate performance

alt

Splitting Data - Random/Rotation

How much data to use?

  • Typically 70-90% training, 10-30% test.
  • BUT to compare models: 50-70% train, split the rest equally between validation and test.

How to split data?

  1. Random: Randomly sample split (60:20:20) without replacement.
  2. Rotation: Take turns selecting points (e.g. 5 data point rotation sequence). Benefit: Equally separates data. Cons: May introduce bias, if rotations match a pattern (e.g. workdays)

Splitting Data - k-Fold Cross Validation

Idea: For each of the k parts - train the model on all the other parts and evaluate it on the one remaining.

  • Pros: Better use of the data (trains across all available data), better estimate of model quality and more effective way to choose a model.
  • For 4-fold, use 3 (ABC) for train, 1 (D) for test. Then rotate each fold to use as validation (see image). Essentially, each fold is used k-1 times for training, and 1 time for validation
  • Finally, average the k evaluations to estimate the model's quality.
  • Most common k is 10.
  • Important: Don't use the resulting models from k-fold cross validation, either by choosing one of the trained model or averaging the coefficients across the k splits. Train the model again using all the data.

alt

Module 4 - Clustering

Definition: Grouping data points

Use:

  • Targeted marketing / market segmentation (market on size / price / versatility / aesthetics)
  • Personalized medicine
  • Locating facilities
  • Image analysis (face recognition, captcha)
  • Data investigation

Distance Norms

Euclidean (straight-line) distance
  • 1-dimension $$ Distance = \sqrt(x_1-y_1)^2+(x_2-y_2)^2 = $$
Rectilinear distance
  • 2-dimension
  • Used for calculating distance in a grid.
  • Also called "Manhattan distance"
$$ Distance = |x_1 - y_1| + |x_2 - y_2| $$
p-norm Distance (Minkowski Distance)
  • n-dimension
  • Sum over all n dimensions
$$ Distance = \sqrt[p]{\sum_{i=1}^{n}|x_i - y_i|^p} $$
Infinity Norms
  • Definition: "The infinity norm simply measures how large the vector is by the magnitude of its largest entry." (reference)
  • It is essentially p-norm distance where p is set to infinity.
$$ Distance = \sqrt{\sum_{i-1}^{n}|x_i - y_i|^{\infty}} = \sqrt[\infty]{\text{max}_i|x_i - y_i|^{\infty}} = max_i|x_i - y_i| $$

Why use infinity norms?

  • Robotics example: Automated storage and retrieval system. Time it takes to store/retrieve depends on moving to the aisle (horizontal) and arm height (vertical).
  • A "one-norm" would be the robot moving to the aisle, then lifting its arm to the correct height.
  • An "infinity-norm" would be the robot moving + stretching its arms and the max time is dictated by whichever takes longer to complete.

K-Means Algorithm

How it works:

  1. Pick k cluster centers within range of data
  2. Assign each data point to nearest cluster center
  3. These data points are not the actual "center" of the cluster. To find the center, recalculater cluster centers (centroids). These centroids are the new centroids.
  4. After centroids are set, some data points might not be in the right cluster (determined by closeness to each centroid). Keep repeating steps 2 and 3 until there no more data point changes clusters.

Characteristics:

  • k-means is a heuristic: Fast, good but not guaranteed to find the absolute best solution.
  • Expectation-maximization - "Expectation" is the mean of a cluster's data point distance to the centroid. "Maximization" is the task - maximizing the negative to a cluster center.
K-means in practice
  1. Outliers: k-means assigns whichever closest cluster to the outlier. Easiest way to solve is remove those outliers. But, best practice is to better understand why those outliers are there.
  2. It's fast: Take advantage of its runtime by running several iterations.
    • For each run, choose a different initial cluster center (randomized initialization), then choose the best solution found.
    • Test different values of k as well.
  3. Understanding k for optimization.
    • Make sure to "fit the situation you're analyzing". k == # of data points may be the most theoretically optimal, but does that actually make sense for the task?
    • Solution: Plot k vs total distance to find the "elbow" of the curve. At a certain number, the benefit of adding another k becomes negligible.
K-means for predictive clustering

Classification task: Given a new data point, determine which cluster set the new data point belongs to. To do this, simply put it into whichever cluster centroid it is closest to.

Another classification task: What range of possible data points would we assign to each cluster?

Image of cluster region, aka "Voronoi diagram" img

Classification / Clustering - Differences

  • Difference is what we know about the data points
  • In classification, we know the correct classification of attributes. This is called supervised learning.
  • In clustering, we don't know the correct classification of attributes. This is called unsupervised learning.

img

Week 3

Module 5 - Basic Data Preperation

Data Preparation: Quantitative Examples

  • credit scores
  • average daily temperature
  • stock price

Things to watch out for before building models:

  • Scale of the data: Can throw off models. Beware of possible outliers.
  • Extraneous information: Complicates the model, harder to correctly interpret the solution.

Outlier Detection

Definition: Data point that's very different from the rest.

Types of outliers
  • point outliers: values far from the rest of the data.
    • Example: Outlier point in a cluster scatterplot
  • contextual outlier: value isn't far from the rest overall, but is far from points nearby in time.
    • Example: Time series data, low temperature at an unexpected time.
  • collective outlier: Something is missing in a range of points, but can't tell exactly where.
    • Example: Missing hearbeat in a time series chart of ECG data.
How to detect outliers
1. Box-and-whisker plot
  • helps find outliers (for 1D data)
  • box is 25/75th percentile of values
  • middle of the box is the median
  • "whiskers" expanding outside of box is the reasonable range of values expected of the data (e.g. 10/90th or 5/95th percentile)
  • Data points outside of the whisker could be considered outlier data.

img

2. Modeling error
  • Idea: Fit model, then find points of high error
  • Example: Fit exponential smoothing model to temperature/time series data.
  • Points with very large error might be outlier. Model will expect smooth curve, but actual data is far from it.

img

Dealing with Outliers

Depends on the data:

  • Bad data (failed sensor, bad experiment, bad input)... or maybe real data?
  • Need to root cause - possible avenues: find data source, understand how it was compiled.
  • With large data, outliers are expected.
    • In a normally-distributed data, 4% of data will be outside 2 standard deviations.
    • With 1,000,000 data points, >2000 expected outside 3 standard deviations.
    • Removing them can be too optimistic, account for noise.

Actions:

  • If really bad data, remove the data, use data imputation
  • If real/correct data, think whether the data you're modeling could realistically have outlier data points (e.g. force majeure incidents causing delays/cancellations).
  • Modeling outliers: Using logistic regression model, estimate probability of outliers happening under different conditions. Then, use a second model to estimate length of delivery under normal conditions, using data without outliers.

Module 6 - Change Detection

Definition: Determining whether something has changed.

Why:

  • Determine if action is needed (preventative maintenance)
  • Determine impact of past action (did discount increase sales?)
  • Determine changes to help planning (did voting patterns change?)
  • Hypothesis tests often not sufficient for change detection, because they are slow to detect change.

Cumulative sum (CUSUM) approach

Definition: Detect increase/decrease or both by the cumulative sum, "Has the mean of cumulative sum gone beyond a critical level"

How:

  • The basic idea is to calculate a running total of the change $t-1$, where t is the number of time steps.
  • If the running total is greater than zero we can keep it, otherwise the running total will reset to zero. This allows change to be detected quickly.
  • The running total ($S_t$) can then be used as a metric to determine if it goes above a threshold $T$.
  • $C$ is a buffer to pull the running down to prevent too many false positives from occurring.

Terms:

  • $S_t$ = Running total of observed values
  • $X_t$ = observed value at time $t$
  • $\mu$ = mean of x, if no change
  • $C$ = buffer value
  • $T$ = threshold valued
  • $X_t - \mu$ = how much above expected the observation is at time $t$
  • $\mu - X_t$ = how much below expected the observation is at time $t$
Formula and Example:

img

Formula to detect increase (True if $S_t \ge T$)

$$ S_t = max{\left\{0, S_{t-1} + (X_t-\mu-C) \right\}} $$

Formula to detect decrease (True if $S_t \ge T$). $\mu$ and $X_t$ is flipped.

$$ S_t = max{\left\{0, S_{t-1} + (\mu-X_t-C) \right\}} $$

Note: Both can be used in conjunction to create a control chart, where $S_t$ is plotted over time and if it ever gets beyond this threshold line, it shows that CUSUM detected a change.

Week 4

Module 7 - Time Series Models

Exponential Smoothing

Time series data will have a degree of randomness. Exponential smoothing accounts for this by smoothing the curve.

Example:

  • $S_t$ - expected baseline response a time period $t$
  • $X_t$ - observed response at $t$
  • What is real blood pressure over time without variation?
  • Different blood pressure, increase in baseline, or random event?
  • Ways to answer:
    • $S_t = X_t$ - observed blood pressure is real indicator
    • $S_t = S_{t-1}$ - today's baseline is the same as yesterday's baseline

Exponential Smoothing Model

  • Sometimes called single, double, triple depending on how many aspects are considered (seasonality, trend)
  • Triple exponential smoothing is also called Winter's Method or Holt-Winters.
Formula
$$ S_t = \alpha x_t + (1 - \alpha)S_{t-1} $$$$ 0 < \alpha < 1 $$
  • $\alpha$ trades off between trusting $x_t$ when $\alpha$ is large, and trusting $S_{t-1}$ when $\alpha$ is small.
  • $\alpha$ -> 0: A lot of randomness in the system.
    • Trust previous estimate $S_{t-1}$
  • $\alpha$ -> 1: Not much randomness in the system.
    • Trust what you see $x_t$.
  • How to start? Set initial condidation: $S_1 = x_1$
  • Does not deal with trends or cyclical variations - more on that later.

Time series complexities

  • Trends (increase/decrease)
  • Cyclical patterns (temp/sales/biometric cycles)

Trends

Exponential Smoothing, but with $T_t$ (trend at time period $t$): $$S_t = \alpha x_t + (1 - \alpha)(S_{t-1} + T_{t-1})$$

Trend at time $t$ based on delta between observed value $S_t$ and $S_{t-1}$ with a constat $\beta$. $$T_t = \beta (S_t - S_{t-1}) + (1-\beta)T_{t-1}$$

  • Initial condition is $T_1 = 0$

Cyclic

Two ways to calculate:

  1. Like trend - additive component of formula, or
  2. Seasonalities: multiplicative way where:
    • $L$: Length of a cycle
      • e.g. $L=24$ for hourly reading of biometrics a day
    • $C_t$: Multiplicative seasonality factor for time $t$.
      • inflates/deflates the observation

Baseline formula (including trend + seasonality)

$$ S_t = \frac{\alpha x_t}{C_{t-L}} + (1 - \alpha)(S_{t-1} + T_{t-1}) $$

Update the seasonal (cyclic) factor in a similar way as trends:

$$C_t = \gamma(\frac{x_t}{S_t}) + (1 - \gamma)C_{t-L}$$
  • No initial cyclic effect: $C_1$, ..., $C_L$ = 1

Example: Sales trends

  • If $C$ is 1.1 on Sunday, then sales were 10% higher just because it was Sunday
  • Of 550 sold on Sunday, 500 is baseline, 50 is 10% extra

Starting Conditions

For trend:

  • $T_1 = 0$: No initial trend

For multiplicative seasonality

  • Multiplying by 1: shows no initial cyclic effect
  • First $L$ values of $C$ set to 1.

Exponential Smoothing: What the Name Means

"Smoothing"

img

  • The equation smoothes high and low spikes using $\alpha$.
    • If we have high observed value $x_t$, the baseline estimate $S_t$ is not as high. High value is only weighted by $\alpha$ which is less than one, and pulled down from high point by previous baseline $S_{t-1}$.
    • If we have low observed value, $(1-\alpha)S_{t-1}$ term pulls the estimate up from the low observed value.

Graph of what it looks like: img

"Exponential"

Each time period estimate can be plugged in like this: img

  • Each time period is weighted by $1-\alpha$ to an increasing exponent.
  • Significance: Every past observations contributes to the current baseline estimate $S_t$.
  • More recent observation are weighted more, i.e. more important.

Applications in Forecasting

Given basic exponential smoothing equation

$$S_t = \alpha x_t + (1-\alpha)S_{t-1}$$

We want to make a prediction $S_{t+1}$. Since $X_{t+1}$ is unknown, replace it with $S_t$.

Using $S_t$, the forecast for time period $t+1$ is

$$F_{t+1} = \alpha S_t + (1-\alpha)S_t$$

hence, our estimate is the same as our latest baseline estimate

$$F_{t+1} = S_t$$

Factoring in trend/cycle

Above equation can beextrapolated to trend/cycle calculations.

Best estimate of trend is the most current trend estimate:

$$F_{t+1} = S_t + T_t$$

Same for cycle (multiplicative seasonality)

$$F_{t+1} = (S_t + T_t) C_{(t+1)-L}$$

Where $F_t+k = (S_t + kT_t)C_{(t+1)-L+(k-1)}$, k=1,2,...

AutoRegressive Integrated Moving Average (ARIMA)

3 key parts

1. Differences

  • Exponential smoothing assumes data is stationary, but reality is that it is not.
  • However, the difference might be stationary.
  • For example:

    • First-order difference $D_{(1)}$: difference of consecutive observations

    $$D_{(1)t} = (X_t - X_{t-1})$$

    • Second-order difference $D_{(2)}$: differences of the differences

    $$D_{(2)t} = (x_t - x_{t-1}) - (x_{t-1} - x_{t-2})$$

    • Third-order difference $D_{(3)}$: differences of the differences of the differences

    $$D_{(3)t} = [(x_t - x_{t-1}) - (x_{t-1} - x_{t-2})] - [(x_{t-1} - x_{t-2}) - (x_{t-2} - x_{t-3}]$$

    • The difference of differences = d times

2. Autoregression

Definition: Predicting the current value based on previous time periods' values.

Augoregression's exponential smoothing:

  • order-$\infty$ autoregressive model.
  • uses data all the way back.

Order-p autoregressive model:

  • Go back only p time periods.

"ARIMA" combines autoregression and differencing

  • autoregression on the differences
  • use p time periods of previous observations to predict d-th order differences.

3. Moving Average

  • Uses previous errors $\epsilon_t$ as predictors $$\epsilon_t = (\hat{x} - x_t)$$
  • Apply order-q moving average (go back q time periods) $$\epsilon_{t-1}, \epsilon_{t-2}, \ldots , \epsilon_{t-q}$$

ARIMA (p,d,q) model

$$ D_{(d)t} = \mu + \sum_{i=1}^{p}\alpha_i D_{(d)t-i} - \sum_{i=1}^{q}\theta_i(\hat{x}_{t-1} - x_{t-i}) $$

Choose:

  • p-th order autoregression
  • d-th order difference
  • q-th order moving average
  • Once chosen, statistical software can be used to find the p, d, q constants through optimization.

Other flavors of ARIMA

  • Add seasonal values of p, d, q
  • Set specific values of p, d, q to get certain qualities
    • ARIMA(0,0,0) = white noise
    • ARIMA(0,1,0) = random walk
  • Can generalize a lot of simpler models:
    • ARIMA(p,0,0) = AR (autoregressive) model
    • ARIMA(0,0,q) = MA (moving average) model
    • ARIMA(0,1,1) = basic exponential smoothing model
  • Short-term forecasting
    • Usually better than exponential smoothing if:
    • Data is more stable, with fewer peaks/valleys/outliers.
  • Need ~40 past data points for ARIMA to work well.

Generalized Autoregressive Conditional Heteroskedasticity (GARCH)

Definition: Estimate or forecast the variance of something, given a time-series data.

Motivation:

  • Estimate the amount of error.
  • Example: Forecast demand for pickup trucks
    • Variance tells how much forecast might be higher/lower than true value
  • Example: Traditional portfolio optimization model.
    • Balances expected return of a portfolio with amount of volatility. (Risker: Higher expected return, vice-versa)
    • Variance is a proxy for the amount of volatility/risk.

Mathematical Model:

$$ \sum_t^2 = \omega + \Sigma_{i-1}^p\beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \gamma_i\epsilon_{t-i}^2 $$
  • Equation is similar to ARIMA, with 2 differences
    1. Uses variance/squared errors, not observations/linear errors.
    2. Uses raw variances, not differences of variances.
  • Stat software can fit the model, given p and q (don't need d since GARCH doesn't use differences)

SUMMARY

  1. Exponential smoothing
  2. ARIMA - generalized version of exponential smoothing
  3. GARCH - ARIMA-like model for analyzing variance

Week 5

Basic Regression

What it explains:

  • Causation / Correlation
    • Value of a home run
    • Effect of economic factors on presidential election
  • Prediction
    • Height of child at adulthood
    • Future oil price
    • Housing demand in next 6 months

Simple Linear Regression (SLR)

Definition: Linear regression with one predictor

  • Looks for linear relationship between predictor and response.

img

Sum of Squared Error

Example:

  • $y_i$ = cars sold for data point i
  • $\hat{y}_i$ = model's prediction of cars sold
  • Data point i's prediction error: $y_i - \hat{y}_i = y_i - (a_0 + a_1 x_{i1})$

Sum of Squared Errors $$ \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (a_0 + a_1 x_{i1})^2 $$

Best-fit regression line

  • line moves around the dimension to minimize sum of squared errors. As it moves, the values of the coefficients $a_0$ and $a_1$ change.
  • Defined by $a_0$ and $a_1$

Maximum likelihood (or how to measure the quality of a model's fit)

  • Likelihood: Measure the probability (density) for any parameter set.
  • Maximum likelihood: Parameters that give the highest probability.

How it works:

  • Assume that the observed data is the correct value and we have information about the variance.
  • Then for any set of params we can measure the probability (density) that the model would generate the estimates it does.
  • Whichever set of params that gives the highest probability density (ie. maximum likelihood). is the best-fit set of params.

Maximum Likelihood - Example

  • Assumption: distributino of errors is normal with mean of zero and variance $\sigma^2$, and independently and identically distributed.
  • Observation: $z_1$, ..., $z_n$
  • Model estimates: $y_1$, ..., $y_n$
  • Maximum-liklihood expection: The set of parameters that minimizes the sum of squared errors

img

MLE in Linear Regression

img

  • $x_{ij}$ is the _j_th observed predicted value of data point $i$.
  • $a_0$ through $a_m$ are the params we're trying to fit.

Maximum Likelihood Fitting

  • Simple example: regression with independent normally distributed errors
  • But can get complex with:
    • different estimation formulas
    • different assumptions about the error
  • Stat software can handle most of these complexities.

Models that Combine Maximum Likelihood with Model Complexity

Akaike Information Criterion (AIC)

  • $L^*$: Maximum likelihood value
  • $k$: number of parameters being estimated
$$ AIC = 2k - 2\ln{(L^*)} $$
  • $2k$ is the penalty term. Balances model's likelihood with its simplicity. This is done to reduce the chance of overfitting by adding too many parameters.
  • Models with the smallest AIC is preferred.

AIC applied to regression

  • $m+1$: number of parameters
  • $L^*$ (max likelihood val) substitue: Linear regression function.

img

Corrected AIC ($AIC_c$)

  • AIC weakness: Works well if there are infinitely many data points.
  • Correct AIC accounts for this by adding a correction term.

Equation:

  • $n$: Correction term
$$ AIC_c = AIC + \frac{2k(k+1)}{n-k-1} = 2k - 2\ln{(L^*)} + \frac{2k(k+1)}{n-k-1} $$

Comparing AIC / AICc models by Relative Probability

Example:

  • Model 1: AIC = 75
  • Model 2: AIC = 80

Relative likelihood $$ e^\frac{(AIC_1 - AIC_2)}{2} $$

Applied to Models 1 & 2: $$ e^\frac{(75 - 80)}{2} = 8.2\% $$

Result:

  • Model 2 is 8.2.% as likely as Model 1 to be better.
  • Considering lower AIC is better, Model 1 will be the better choice

Bayesian Information Criterion (BIC)

  • $L^*$: Maximum likelihood value
  • $k$: Number of parameters being estimated
  • $n$: Number of data points
$$ BIC = k \ln{(n)} - 2 \ln{(L^*)} $$

Characteristics:

  • Similar to AIC
  • BIC's penalty term is greater than AIC.
  • Encourage models with fewer params than AIC.
  • Only use BIC when there are more data points than parameters

BIC Metrics - Rule of thumb

img

Summary of AIC / BIC

  • AIC = Frequentist, BIC = Bayesian
  • No hard-and-fast rule for using AIC or BIC or maximum likelihood.
  • Looking at all three can help you decide which is best.

Understanding Regression Coeficients

Baseball Example: Determine average number of runs a home run is worth.

  • Response: How many runs a team scored taht season
  • Predictors: # of home runs, triples, doubles, singles, outs, etc.

Equation:

$$ Runs Scored = a_0 + a_1[Number of HR] + a_2[Number of Triples] + \ldots + a_m[Other Predictors] $$

Applications of LR:

  1. Descriptive Analytics
  2. Predictive Analytics
  3. ~Prescriptive Analytics~ (Not used for this)

Causation vs Correlation

Causation: One thing causes another thing Correlation: Two things tend to happen or not happen together - neither of them might cause the other.

Application in Linear Regression:

  • x: city's avg daily winter temp
  • y: hours/days spent outdoors in winter
  • x may have causation over y (warmer avg temp -> more time outdoors in winter)
  • y does not have causation over y (more time outdoors in winter -> warmer avg temp)

How to decide if there is causation?

  • Cause is before effect.
  • Idea of causation makes sense (subjective)
  • No outside factors causing the relationship (hard to guarantee)

Transforming Data for Generalized LR Modeling

img

  • Problem: LR won't fit data on right.
  • Solution: Transform the data

Method 1: We can adjust the data so the fit is linear.

  • Quadratic Regression
$$ y = a_0 + a_1x_1 + a_2x_1^2 $$

Method 2: We can transform the response

  • Transform response with logarithmic function $$ \log(y) = a_0 + a_1x_1 + a_2x_2 + \ldots + a_mx_m $$

  • Box-Cox transformations (link) can be automated in statistical software.

When to Transform Data

Example: Variable interaction

  • Ex: Estimating 2-yr old's adult height.
  • Predictors:
    • Father's height
    • Mother's height
    • Product of the parents' heights ($x_1x_2$)
  • We can the third predictor as a new input $x_3$
$$ y = a_0 + a_1x_1 + a_2x_2 + a_3(x_1x_2) $$

How to Interpret Output

1: P-Values

  • Estimates the probability that coefficient might be zero (ie. type of hypothesis testing).
  • Rule of thumb: If coefficient's p-value is > 0.05, remove the attribute from the model.
  • Other thresholds besides 0.05 can be used:
    • Higher: More factors included
      • Con: May include irrelevant factors
    • Lower: Less factors included
      • Con: May leave out relevant factors.

Warnings About P-Value

  • With large # of data, p-values get small even when attributes are not related to response.
  • P-values are only probabilities, even when if it seems meaningful.
    • Example: 100 attrs - 0.02 p-value each
    • Each has 2% chance of not being significant.
    • Expect that 2 of them are really irrelevant.
  • Some things to think about (from office hours):
    • Does the effect size seem sensible to support such small p-value or p-values are shrink due to sample size?
    • What happens to p-value a given attribute if others are removed from the regression model?
    • What does the known relationship between response and covariate tell you about this p-value? Is it likely to be a spurious but strong correlation or likely to be a genuine effect?

2: Confidence Interval (CI)

  • Where the coefficient probably lies
  • How close that is to zero.

3: T-Statistic

  • The coefficient divided by its standard error.
  • Related to p-value.

4: Coefficient

  • Check if coefficient doesn't make any difference, even if low p-value.
  • Example: Estimating household income with age as an attribute.
    • If coefficient of age is 1, even with low p-value, the attribute isn't important.

(A bit more explanation on contextualizing the coefficient of 1.0 in this context explained in Piazza):

Notice that in that lecture (M5L6, timestamp ~3:10) he's talking about the value of the coefficient relative to the scale of the attribute in question and the response value. A coefficient of 1.0 on the age attribute isn't that significant because the coefficient is in units of USD/year of age, and the response is a household income. We can make a decent guess what values the age variable will take (probably adult, probably younger than retirement age), and those values multiplied by 1.0 (on the order of tens of dollars) aren't going to make much difference in a US household income (on the order of tens of thousands of dollars). That phrase "the coefficient multiplied by the attribute value" is important.

So this isn't a rule of thumb about the value 1.0. This is advice to keep the scale of your attributes and response in mind when interpreting coefficients.

5: R-Squared Value

  • Estimate of how much variability your model accounts for.
  • Example:
    • R-square = 59%
    • THis mean model accounts for ~59% of variability in the data.
    • Remaining 41% is randomness or other factors.

Adjusted R-squared: Adjusts for the number of attributes used.

Warnings about R-squared:

  • Some things are not easily modeled, especially when trying to model real-life systems.
  • In these cases, even a R-squared of 0.3-0.4 is quite good.

From Office Hours: W5 HW Review

Key Assumptions of Linear Regression

  1. Linearity - There needs to exist a linear relationship between X and Y
  2. Independent - (no auto correlation) each point is independent from each other. Time series violates this.
  3. Normlality - Residual (difference between predicted and actual value) needs to be normal. Don't want to see pattern in residuals.
  4. Constant variances (homoscedasticity) - residual need to have constant variance.
  • P-value (in Linear Regression) is probability that the coefficien'ts value is actually zero. Higher P-value means greater probability.
  • Multiple R-squared: Percentage of variance explained by your model (0 - 1)
  • Adjusted R-squared: Counts number of predictors used. More predictors lowers R-squared. Reduce bias in using too many predictors.
  • Big diff between multile vs adjusted R-squared means there are probably too many predictors.

How to calculate R-squared values directly from cross validation

# R-squared = 1 - SSEresiduals/SSEtotal

# total sum of squared differences between data and its mean
SStat <- sum(dat$Crime - mean(data$CRime))^2)

# for model, model2, and cross-validation, calculated SEres
SSres_model <- sum(model$residuals^2) #model 1
SSres_model2 <- sum(model2$residuals^2)
SSres_c <- attr(c, "ms")*nrow(dat) # MSE, times number of data points, gives sum of squared error

# Calculate R-squared
1 - SSres_model/SStat # initial model with insignificant predictors
# 0.803

1 - SSres_model2/SStat # model2 without insignificant predictors (based on p-value)
# 0.766

1 - SSres_c/SStat # cross-validated
# 0.638

# This shows that including the insignificant factors overfits
# compared to removing them, and even the fitted model is probably overfitted.
# This is not suprising since we only have 47 data points and 15 predictors (3:1 ratio)
# Good to have 10:1 or more.

HW 6 Preview - PCA

Q0.1 - Using crime datset, apply PCA and then create a regression model using the first few principal components.

  • specify new model in terms of original variables (not principal)
  • compare quality of PCA mode lto that of your solution to Q8.2 (lin reg with no PCA)
  • You can use R function prcomp for PCA (scale data first scale=TRUE)
  • Don't forget to unscale the coefficients to make a prediction for the new city (ie. do the scaling calculation in reverse).

  • Eigenvalues Correspond to Principal Components

  • Eigenvectors correspond to the direction of the new axes.
    • Can find through eigendecomposition, or SVD of data covariance matrix.
    • Data is then projected onto these axes (matmult) to get new features.
    • prcomp does all of this under the hood
    • Still need to scale.
  • PCA will return n principle components: Usually only want to use the first several (usually 2). Using all doesn't make sense because you're not taking advantage of the reduced dimensionality.

How to decide which PC is important

  • prcomp will rank them for us.
  • 1st will have highest eigenvalue, and so forth.
  • You can plot this by cumulative explained variance vs individual explained variance (see plot)

img

Week 6 - Advanced Data Preparation

Box-Cox Transformations

Why: Normality assumption

  • Some models assume data is normally distributed. Bias occurs when this assumption is wrong.
  • Left chart is normal, right chart shows heteroscedasticity (unequal variance)
    • (If applying regression model to right chart) Higher variance at upper end can make those estimation errors larger, and make model fit those points better than others.
  • We can account for this via several methods.

img

Box-Cox Transformation

  • What: Transforms a response to eliminate heteroscedasticity.
  • How: Applies logarithmic transformation by:
    1. stretches smaller range to increase its variability.
    2. shrinks larger range to reduce its variability.

Box-Cox Formula:

  • $y$: Vector of responses
  • $t(y)$: Transformed vector
  • $\lambda$: Param to adjust
  • Goal: Transform $t(y)$ to normal distribution
$$t(y) = \frac{y^\lambda - 1}{\lambda}$$

NOTE: First check whether you need the transformation (e.g. QQ plot)

Detrending

Why: Time series data will often have a trend (e.g. inflation, seasons) which may bias the model. Need to adjust for these trends to run a correct analysis.

  • Example: Price of gold over time, raw price versus inflation-adjusted price.
  • If fitting regression model based on other factors with data as-is, it will return a very different response using same factor values depending on the decade.
  • Adjusting inflation will return a closer value regardless of time.

When: Whenever using factor-based model (regression, SVM) to analyze time-series data*

("Factor-based model" uses a bunch of factors to make a prediction, non-factor based model example would be a model using predictors based on time and previous values)

On What:

  • Response
  • Predictors (time series data)

How:

  • Factor-by-factor
    • 1D regression: $y = a_0 + a_1x$
    • Ex. Simple linear regression of gold prices ($Price = -45,600 + 23.2 * Year$)
    • Detrended: $Actual price - (-45,600 + 23.2 * Year)$
  • Example charts: Raw (left), Inflation-adjusted (center), Detrended via linear regression (right)
    • Center and right a bit different bc linear regression accounts for different inflation rates each year.

img

Intro to Principal Component Analysis (PCA)

Why:

  1. We may have too many factors, difficult deciding which ones are important.
    • Ex. Endless number of features to choose from to predict stocks, how to reduce this dimension.
  2. There may be high correlation between some of the predictors. Leads to redundant predictors.

What:

  • PCA transforms data by:
    1. Removing correlations within the data.
    2. Ranking coordinates by importance.
  • Give you n principal components:
    • Concentrate on first n PCs.
    • Pro: Reduce effect of randomness.
    • Pro: Earlier PCs likely to have higher SNR (signal-to-noise ratio).
  • Plot of PCA (chart) in 2 PCs:
    • D_1 becomes new x-coordinate.
    • D_2 becomes new y-coordinate.
    • New coordinate's correlation becomes zero, or orthogonal.
    • PCA ranks PCs by amount of spread in value (e.g. D_1 has more spread than D_2, so it becomes PC_1)

img

How PCA Works (Math)

$X$: Initial matrix of data (scaled)

  • $i$: data point
  • $j$: factor
  • $X_{ij}$: $j$-th factor of data point $i$
  • Scaled such that $\frac{1}{m} \Sigma_i x_{ij} = \mu_j = 0$
    • For each factor $j$, the average value of $x_{ij}$ over all data points is shifted to be zero.

Find all of the eigenvectors of $X^TX$

  • $V$: Matrix of eigenvectors (sorted by eigenvalue)
  • $V = [V_1, V_2 \ldots ]$, where $V_j$ is the $j$-th eigenvector of $X^TX$

PCA:

  • Each principal component is a transformation of the scaled data matrix ($X$) and eigenvector matrix ($V$) (ie. $XV_n$)
  • $XV_1$ = first component, $XV_2$ = second component
  • Formula for $t_{ik}$, ie. the $k$-th new factor value for the $i$-th data point:
$$ t_{ik} = \Sigma^m_{j=1}x_{ij}v_{jk} $$

How to use beyond linear transformation

  • Use kernels for nonlinear functions.
  • Idea is similar to SVM modeling.

How to interpret the model? (Regression)

Problem: How do we get the regression coefficients for the original factors instead of PCs?

Example (Regression): PCA finds new $L$ factors ${t_{ik}}$, then regression finds coefficients $b_0, b_1, \ldots, b_L$.

$$ \begin{aligned} y_i &= b_0 + \Sigma^L_{k=1} b_k t_{ik} \\ &= b_0 + \Sigma^L_{k=1} b_k [ \Sigma^m_{j=1} x_{ij} v_{jk} ] \\ &= b_0 + \Sigma^L_{k=1} b_k [\Sigma^m_{j=1} x_{ij} v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [\Sigma^L_{k=1} b_k v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [a_j] \end{aligned} $$

Implied regression coefficient for $x_j$: $$ a_j = \Sigma^L_{k=1} b_k v_{jk} $$

  • Each original attribute’s implied regression coefficient is equal to a linear combination of the principal components’ regression coefficients.

Eigenvalues and Eigenvectors

Given $A$: Square matrix

  • If we can find vector $v$ and constant $\lambda$ such that $A_v = \lambda_v$,
  • $v$ is eigenvector of $A$
  • $\lambda$ is eigenvalue of $A$
    • $det(A - \lambda l ) = 0$, meaning:
    • For every value of $\lambda$ where the determinant of $A$ minus $\lambda$ times identity matrix equals zero, every one of those $\lambda$ values is an eigenvalue of $A$.
  • Opposite direction: Given $\lambda$, solve $A_v = \lambda v$ to find corresponding eigenvalue $v$

PCA: Pros and Cons

Points to consider:

  • Just because the first principal component has more variation doesn't mean it's always the most helpful for predictive modeling. This is because PCA depends only on independent variables, not response or calculation.
  • This could (on rare occasions) lead to cases where the response is only affected by variables with low variability, not by those with high variability (?)

Example: Classification

  • Left (Good): PCA gives PCs D1 and D2. D1 is helpful for differentiating between blue/red points. D2 essentially gives dividing line between the two colors.
  • Right (Bad): D1 doesn't help separate between red and blue points. D2 is actually the more helpful separator.

img

Takeaway:

  • Usually, dimensions that have more variability (or better differentiators), specifically bc they have more information. But this is not always true.

Week 7 - Advanced Regression (Trees)

Types:

  • CART (Classification and Regression Trees)
  • Decision Tree

Regression Trees

Terminology:

  • branch: points where a splitting heuristic is applied.
  • nodes: a group of data points after the data is split.
  • internal nodes: non-terminal nodes
  • leaf nodes are the terminal nodes where there are no more branches beyond it.

img

Method:

  1. Stratify feature space: Split data points by a certain predictor (e.g. age). This is greedy recursive binary splitting (refer to ISL Ch.8).
    • Typically use several predictors to determine split (e.g. age, number of kids, household income).
  2. Create a regression model at each split, using only the data points that belong in each node. This creates multiple regression models.
  3. Prediction: For each new data point pass the data through the decision tree and then use the leaf node's coefficient to return a response.

How regression trees are actually calculated

  • Problem: Running a full regression models on each node is expensive, plus each depth will have higher likelihood of overfitting (because less data).
  • Solution: Just calculate the constant term. We do this by getting the average response in the node:
$$ a_0 = \frac{\Sigma_i\text{in node} y_i}{\text{num data points in node}} = \text{avg response in node} $$

Derivation: img

Other uses of trees The model used at each node would be different:

  1. Logistic regression models: Use fraction of node's data points with "true" response.
  2. Classification models: Use most common classification among node's data points.
  3. Decision models: At each branch determines whether or not we meet a threshold to make a decision (e.g. send marketing email).

Branching

2 things to consider:

  1. Which factor(s) should be in the branching decision?
  2. How should we split the data?

Branching Method

(Note, this is one method - there are other ways to do this)

Key Ideas:

  • Use a metric related to the model's quality.
  • Find the "best factor" to branch with.
  • Verify that the branch actually improves the model. If not, prunce the branch back.

How to reject a potential branch

  • Low improvement benefit (based on threshold)
  • One side of the branch has too few data points now (e.g. each leaf must contain at least 5% of original data).

"Overfitting our model can be costly; make sure the benefit of each branch is greater than its cost"

img

Random Forests

Motivation

  • Introduce randomness
  • Generate many trees
  • Average of multiple trees results in better predictions.

Introducing randomness

We introduce randomness in 3 steps:

  • Bootstrapping: Sample N datapoints from original data and build a tree.
  • Branching: Randomly choose a subset of factors, and choose best factor form the subset at each branch.
    • Common number of factors to use is $1+log_n$
  • No pruning.

Effect:

  • Each tree in the forest has slightly different data.
  • We end up with a lot of different trees (usually 500-1000)
  • Each tree gives us a different regression model.

Results:

  1. For regression, use the average predicted response (average).
  2. For classification, use the most common response (mode).

Pros/Cons:

  • Pro: Average between trees can neutralize over-fitting. This gives better predictions.
  • Con: Harder to explain/interpret. Doesn't provide direct weights for each predictor. "Black-box" predictor.

Explainability / Interpretability

Why explainability matters:

  • Helps us understand the "why" of the model.
  • Helps stakeholders accept our ideas.
  • Sometimes a legal requirement.
  • However, less explainable models may give better results (can fit better to complex data).

Linear Regression Example

img

Takeaway: Linear regression models are pretty explainable. Each coefficients is tied to a specific predictor.

Regression Tree Example

img

Takeaway: Tree-based models are less explainable. Branching logic is often not intuitive. Random forest is even worse.

Logistic Regression

Definition: Turns regression model into a probability model.

Standard Linear Regression:

$$ y = a_0 + a_1x_1 + \ldots + a_jx_j $$

Logistic Regression: Takes linear function and puts it into an exponential.

  • $p$: probability of the even you want to observe
$$ \log{\frac{p}{1-p}} = a_0 + a_1x_1 + \ldots + a_jx_j $$$$ p = \frac{1}{1+e^{-(a_0 + a_1x_1 + \ldots + a_jx_j)}} $$

Now this function can return a number from $-\infty$ to $+\infty$.

  • If $-\infty$, then $p=0$
  • If $+\infty$, then $p=1$

Logistic Regression Curve

Example 1:

img

  • Size of each dot shows how many observations there are.

Example 2:

img

  • If there are 0/1 responses for almost every predictor value, we can visualize it like above.
  • Each bar shows the fraction of responses that are one for each predictor value
  • Shows how LR curve fits the data.

Logistic Regression vs Linear Regression

Similarities

  • Transforms input data
  • Consider interaction terms
  • Variable selection
  • Single/random logistic regression trees exist.

Differences

  • Longer to compute
  • No closed-form solution
  • Understanding quality of model

Measures of model quality:

  • Linear regression used R-squared value (% of variance in the response that is explained by the model)
  • Logistic regression uses a "pseudo" R-squared value, because it's not really measuring % of variance.

Logistic Regression as Classification

Thresholding:

  • Answer "yes" if $p > N$, otherwise answer no.
  • Example: if $p \ge 0.7$, give loan.

Receiver Operating Characteristic (ROC) Curve

img

Area Under Curve (AUC) = Probability that the model estimates a random "yes" point higher than a random "no" point.

  • Higher means less random prediction (surrogate for quality).
  • Example: Loan payment
    • Joe: repaid loan, Moe: Didn't repay loan.
    • AUC explains probability that model gives Joe's data point a higher response value than Moe.
  • $AUC = 0.5$ means we are just guessing.
  • Also called "corcordance index".
  • Doesn't differentiate cost between False Negative and False Positive__.

Confusion Matrix

img

There are many metrics derived from confusion matrix, main ones to know are:

  • $Sensitivity = \frac{True Positive}{True Positive + False Negative}$
  • $Specificity = \frac{True Negative}{True Negative + False Positive}$
  • $Precision = \frac{True Positive}{True Positive + False Positive}$
  • $Recall = \frac{True Positive}{True Positive + False Negative}$

Evaluating a model's quality using Confusion Matrix

Example: Spam detection, confusion matrix

img

Idea: We can assign costs for each factor and calculate the sum.

$$ Cost = TP * Cost + FN * Cost + FP * Cost + TN * Cost $$

Example: Cost of lost productivity

  • \$0 for correct classifications.
  • \$0.04 to read spam
  • \$1 to miss a real message
  • Total cost: \$14
$$ \$14 = 490 * \$0 + 10 x \$1 + 100 * \$0.04 + 400 * \$0 $$
  • If ratio of spam changes, add a scaling factor into the equation (new value divided by old value):

img

Poisson regression

Used when we think the response follows a Poisson distribution.

$$ f(z) = \frac{\lambda^ze^{-\lambda}}{z!} $$
  • Example: count arrivals at an airport security line.
    • arrival rate might be a function of time.
    • We estimate $\lambda(x)$

Regression splines

  • Spline: Function of polynomials that connect to each other.
  • Knot: Point at which each function connects to another.
  • Fit different functions to different parts of the data set.

img

Other splines:

  • Order-k regression spline: polynomials are all order k
    • Example: multi-adaptive regression splines (MARS)
    • Called "Earth" in many stat libraries.

Bayesian Regression

We start with:

  • Data, plus
  • Estimate of how regression coefficients and the random error is distributed.

  • Example: Predict how tall a child will be as an adult based on:

    • Data: heights of the child's mother and father
    • Expert opinion: starting distribution.
      • Example: coefficients uniformly distributed between 0.8 and 1.2
    • Use Bayes' theorem to update estimate based on existing data.
    • Most helpful when there is not much data. Use small existing data to temper expert opinion.
    • If no expert opinioni, choose broad prior distribution (e.g. uniform over large interval).

KNN Regression

  • No estimate of prediction function.
  • Plot all the data.
  • Predict response, using average response of K closest data points.

--- END OF MIDTERM 1, START MIDTERM 2 MATERIAL --


Week 8 - Variable Selection

NOTE: These use regression, but can be applied to any model.

Motivation

  1. Overfitting: When # of factors >= than # of data points. Begins fitting to random effects.
  2. Simplicity: Simpler modles are better than complex ones.
    • Less expensive
    • Less chance of insignificant factors added (multiple factors with low p-value still sums to significant p-value).
    • Easier to explain
    • Some factors are illegal to use; can't use factor highly correlated with forbidden ones either; Also hard to demosntrate that a complex model is innocuous.

Variable Selection Approaches

Two Schools: Classic and Optimization

Classical

What:

  • Decision made step by step.
  • Greedy Algorithm: At each step takes one thing that looks best without considering future options.

Types:

  1. Forward Selection
  2. Backward Elimination
  3. Stepwise Regression
Optimization

What: Applies optimization for variable selection.

Types:

  • LASSO
  • Elastic Net

(Clasical) Forward Selection

Definition: Keep adding "good enough" factors until there's no more to add. Then remove those with high p-value.

img

  • Definition of "good enough" and "high p-values" are subjective; Typically 0.1-0.15 for "good enough", p-val > 0.05 for "high p-value".

(Classical) Backward Elimination

Definition: Keep removing "bad" factors until there's no more to remove.

img

(Classical) Stepwise Regression (Forward + Reverse)

Definition: Combine forward selection / backward elimination.

img

(Optimization) LASSO

What: Adds constraint to standard regression equation (or other algos), then optimizes on it.

Math:

  1. Minimize standard regression equation: $$ \min{\Sigma^m_{i=1}} (y_i - (a_0 + a_1x_{i1} + \ldots + a_nx_{in}))^2 $$

  2. Add "budget" $T$ to minimize squared sum of errors. This "constrains" the sum of coefficients from becoming too large.

    • $a_j$: Sum of coefficients
    • $T$: Constraint factor $$ \text{Subject To } \Sigma^n_{j=1} |a_j| \le \tau $$

Tips:

  • Make sure to scale your data (like SVM)
  • How to choose $T$ - Factors to consider:
    • Number of variables
    • Quality of model
    • Choose $T$ that gives best tradeoff of the two.

(Optimization) Elastic Net

What: Constrain combination of absolute value of coefficients and their squares.

  • Similar approach to choosing $T$, but also for $\lambda$.

Math (Minimization same as LASSO):

Add contrainst for coefficients and their squares:

$$ \text{Subject To } \lambda\Sigma^n_{j=1} |a_j| + (1-\lambda)\Sigma^n_{j=1}a_j^2 \le \tau $$

(Aside) Ridge Regression

  • Same as Elastic Net, but removes aboslute value term in the constraint.
  • NOT used for variable selection!
  • Used for prediction.

Math (Minimization same as LASSO):

$$ \text{Subject To } \Sigma^n_{j=1} a_j^2 \le \tau $$

Lasso Regression (LR) vs Ridge Regression (RR)

  • LR/RR mathematically similar, but LR does variable selection and RR does not. Why?

img

  • When plotted in 2D coefficients, LR displays a diamond. Anything within the yellow satisfies the constraint. Same applied to RR shows a circle.

img

  • The difference is how those constrains are applied to the same error function.
  • The function (rearranged) shows that it is a quadratic function (visually as an ellipse). Depending on choice of coefficients, this can increase/decrease in size.

img

  • For every ellipse, every point has same total error.
  • We are trying to find the smallest value of error function, while meeting the optimization constraint.
  • This is shown visually as where the ellipse and shaded area intersect.
  • As total error gets larger, ellipse gets closer to shaded area.
  • The point at which they touch is the smallest error we can find AND satisfies LR constraint.
  • IF the error ellipse touches only one of the corners (e.g. $a_1$, it shows that $a_2$'s coefficient can be zero, hence we can remove that variable
    • Note that ellipse can touch both axis, then we can keep both.
    • The more variables we have and more resticitve $\tau$ is, the more variables we'll likely to see removed.

img

  • Ridge regression has a circle shaded area, so the error function wil most likely touch an area where both $a_1$ and $a_2$ are non-zero.
  • Hence, we rarely see factors get a coefficient of zero (ie variable removal). img

Some more notes from ISLR

  • Lasso uses L1 regularization, while ridge regression uses L2.

The "Bias-Variance Tradeoff"

Bias:

  • High bias: Less fit to real patterns (bad)
  • Low variance: Less fit to random patterns (good)
  • Extereme case of high bias is just using intercept: $$ y = a_0 $$

Variance:

  • Low bias: More fit to real patterns (good)
  • High variance: More fit to random patterns (bad)
  • Exterme case of high variance is just using coefficients and no bias.

The goal is to find the "sweet spot" where total errors is at the minimum, and underfit/overfit errors are both minimized.

img

Ridge Regression

Motivation: RR reduces overfitting (variance) by scaling down the magnitude of each coefficient.

  • This is a type of regularization.
  • Coefficients ($a_1$) show magnitude of effect ($x_1$).

Application: Use when model is performing worse on validation data than training data.

  • We can't just reduce fit to random patterns, so we need to just reduce its fit to all patterns.
  • By reduce coefficient magnitude, the differences between the model's responses decrease, thus reducing overfitting.
  • Just be careful not to underfit.

Choosing a Variable Selection Model

Motivation: How to decide which VS model to choose?

Classic options (forward/backward/stepwise):

  • Good for initial analysis.
  • May tend to overfit, hence performance may suffer.

Optimization options (LASSO, Elastic net)

  • Slower but better prediction.
  • Elastic net Pros/Cons:
    • Pros: Has VS benefits of LASSO plus predictive benefits of RR.
    • Cons:
      • Arbitrarily removes multiple correlated predictors. It may end up keeping the more expensive predictor.
      • The RR part may reduce the magnitude of very relevant predictor's coefficient.
  • Best practice: No good rule of thumb, compare the difference between each! Human input always beneficial.

Week 9.1 Design of Experiments

Design of Experiments (DOE)

Motivation:

  • We can't always get data easily.
  • How to get representative sample (many dimensions, not so many data points)

Comparison and Control

Definition: When designing experiments, we need to control extraneous factors.

Blocking

Definition: Factors that creates variation.

Example: Red car vs Blue cars

  • Sports car vs family car can create variation within these subsets.
  • Reduces variance.

A/B Testing

Definition: Create control and treatment group(s), test which of two alternatives is better.

Three things that needs to be true:

  1. Collect data quickly
  2. Data must be repreesentative
  3. Amount of data is small compared to whole population

Factorial Design

Motivation: Understand importance of all factors between A/B test.

Definition: Test the effectiveness of every combination.

  • Example: 2 fonts 2 wordings 2 backgrounds = 8 combinations.
  • Run ANOVA (analysis of variance) to determine importance of each factor.
  • What if there are many combinations?
    • Ex: 7 factors, 3 choices each = $3^7$ = 2187 combos.
  • Solution: Fractional factorial design - choosing a subset of those factors.
    • Need to use a balanced set of choices.

img

Independent Factors: Test subset of combos, then use regression to estimate effects.

  • Each factor will have a categorical variable (background color [gold, blue, white] and font size [10,12,14].
  • Use regression to estimate response (e.g. number of clicks).
  • Assumption is that each factor is independent.

Multi-armed Bandits

"Exploration vs Exploitation"

  • Need to balance benefit of more information (explore) vs getting immediate value (exploit).
  • Ex: 10 alternatives, 10K tests (1K on each alternative)
    • If first 1K test got maximum results, the rest of 9K nets negative value.

Definition:

  • Run multiple tests and assign more tests to those likely to succeed.
  • Benefit: Explore on the fly but also create more value along the way.

Method:

  1. Decide K alternatives
  2. Start with NO information (equal probability of selecting each alternative)
  3. After N number of tests, update probabilities of each one being the best alternative.
  4. Start assigning new tests according to those probabilities.
    • By giving more weight on the likely best alternatives, we are still exploring but making sure we are exploiting as well.
  5. Repeat until sufficient sure of the best alternative.

Parameters:

  • Number of tests between recalculating probabilities.
  • How to update the probabilities (Bayesian, estimate on observed distributions)
  • How to pick an alternative to test based on probabilities.
    • e.g. look at distribution of each alternative being best, or taking into the expected value of each.

Week 9.2 Probabilistic Models pt.1

Summary:

  • Sometimes a simple model is all you need. Collecting data for lots of factors can be expensive and less ROI.
  • Probability distributions can form the backbone of such simple models.

Bernoulli Distribution

Definition:

  • Coin flip (not always 50/50).
  • Useful when we do multiple "coin flips" across several factors.

Probability mass function (PMF)*

$$ P(X=1) = p \\ P(X=0) = 1-p $$

(PMF represents the probability associated with each possible value of the random variable)

Example: Charity donations

  • Charity asks for donation from 1/12 of their mailing list each month.
  • For each person:
    • $P(\text{sends donation}) = p$
    • $P(\text{does not send donation}) = 1-p$
  • Assumptions:
    • Donations are same size each month.
    • Each donor is independent (ie. bernoulli trial for each potential donor).
    • $p$ does not change month to month. Otherwise, the charity won't have the consistent cash flow.
  • The number of donations each month is then binomially distributed.
    • Test: Observe how many successful trials each month, then use the binomial distribution to find the probability of getting at least that observed number of donations.
    • If probability is very low, it is likely the $p$ is different for that month.

Binomial Distribution

Definition: Probability of getting $x$ successes out of $n$ independent identically distributed Bernoulli ($p$) trials.

  • Responses must be i.i.d.!
  • ie. $p$ cannot differ between each time period.

Probability Mass Function: $$ P(X=x) = (^n_x)p^x(1-p)^n-x \\ = (\frac{n!}{(n-x)!x!})p^x(1-p)^{n-x} $$

Graph:

  • With greater $n$ the binomial distribution converges to normal distribution. img

Geometric Distribution

Definition: Probability of having $x$ Bernoulli($p$) failures until first success.

  • (Or in reverse - Having $x$ Bernoulli(1-p) successes until first failure)
  • Think carefully before deciding $p$ and its opposite $1-p$!

Examples:

  • How many hits until a baseball bat breaks
  • How many good units manufactured before defective one.

Probability Mass Function

$$ P(X=x) = (1-p)^xp $$

Graph: img

Case Study: Assuming that each Bernoulli trial i.i.d - can we compare data to Geometric to test whether i.i.d is true?

  • Example 1: How many hits before a baseball bat breaks.
    • If it fits geometric distribution (hits are i.i.d. Bernoulli trials, hits are i.i.d. Bernoulli trials.
    • If it doesn't fit, hits are not i.i.d. and need to consider other factors.
  • Example 2: Are security personnel more likely to pull someone aside if there hasn't been a search in a while?
    • Question is whether screeners are screening each passenger based on independent factors, or if they are selecting people based on whether or not someone was screened recently.
    • Left chart follows geometric distribution, ie. each person is screened independently.
    • Right chart does not, there are other factors at play (like the last time someone was screened).
  • Example 3: Probability of successful sales call, wher probability of having 5 successful sales calls before the first unsuccessfull call:
    • Probability: $p^5(1-p)$

img

Poisson / Exponential / Weibull

Poisson Distribution

  • Good at modeling random arrivals
  • $\lambda$: average number of arrivals/time period
  • arrivals are i.i.d.

Probability Mass Function: $$ f_x(x)=\frac{\lambda^xe^-\lambda}{x!} $$

Chart: img

Exponential Distribution

Relation to Poisson:

  • If arrival are Poisson($\lambda$), then time between successive arrivals is exponential($\lambda$) distribution.

Probability Mass Function $$ f_x(x) = \lambda e^{-\lambda x} $$

Chart: img

Case study of relation between Poisson and exponential (security line delays at airport):

  • Experiment collected inter-arrival times
  • Data showed inter-arrival times were exponentially distributed,
  • This mean number of arrival per unit time is Poisson.
  • Note: This can work vice-versa as well!

Weibull Distribution

  • Geometric models number of tries between failures Definition: The length of time it takes before a failure.
  • Lightbulb example:
    • Geometric: How many lightswitch flips on/off before bulb fail?
    • Weibull: Leave the bulb on; how long until bulb fails?

Probability Mass Function:

  • $\lambda$: scale param
  • $k$: shape param $$ f_x(x) = \frac{k}{\lambda}(\frac{x}{\lambda})^{k-1} e^{-(\frac{x}{\lambda})^k} $$

Intuition of $k$:

  • $k < 1$: Models failure rate decreasing with time (e.g. defects)
  • $k > 1$: Models failure rate increasing with time (e.g. tires)
  • $k = 1$: Models failure rate constant with time.

Relation between Weibull / Poisson:

  • When $k=1$ Weibull = exponential.
  • Just replace $\lambda$ to $\frac{1}{\lambda}$ to get the exponential distribution.

Using software to fit data

  • Input: data
  • Output: Fit to varying distributions / params
  • Only use software output as guidance
    • Example: software return Weibull, k=1.002
    • This is basically exponential (since Weibull k=1 is same as exponential)
    • It may just return Weibull bc data has some noise in it.

Q-Q Plots

Motivation: Even if 2 sets of data have different scale of data / number of data points, 2 similar distributions will have about the same value at each quantile.

  • Example: If 2 sets of students share similar distribution, the height at each percentile should be about the same.

Use:

  • Understand whether 2 sets of data share similar distribution
  • Understand whether 1 set of data fits 1 specified distribution.

Note: Statistical tests can do the same, but sometimes visual representation is easier to understand.

Good fit vs Bad fit: img

  • Good match (left): both distributions fit into 45 degree line of percentiles
  • Bad match (right): Dataset 2 has higher values in the lower percentiles (lower tail) as well as the higher percentiles (higher tail) compared to dataset 1.

Misleading distribution caught by QQplot: img

  • Statistical tests may erroneously say these are good match.
  • Visually, we can see that the dataset 2 has higher values in lower tail and lower values in upper tail.
  • If we care about the extereme ends, the 2 datasets are not a good match.

Testing single dataset

  • X-axis: Data
  • Y-axis: Theoretical values of percentiles
  • Note: This is not always the case, some software plot them the other way

Week 9.2 - Probabilistic Models pt.2

Case Study: Queuing

img img img

A more complex example: Adding 2 employees to the call center and only hold 4 customers in the queue.

  • Can still give a closed-form answer due to memoryless property

Memoryless Property

Definition: Doesn't matter what has happened in the past, what matters is where we are now.

Memoryless Exponential distribution: Distribution of remaining time = initial distribution of time.

  • In exponential distribution, it doesn't matter how long an employee has been on the phone. The distribution of the remaining call time is the same.
    • Note: This is purely a property of the exponential distribution (not something we need to know beforehand about phone calls).
    • If phone call time fits this distribution, it is memoryless.

Memoryless Poisson distribution: Distribution of time to next arrival = initial distribution of time to next arrival.

  • Poisson is related to exponential distribution.
  • Poisson interarrival times are exponentially distributed.

Works the other way too:

  • If data fits exponential distribution -> memoryless
  • If we know data is not memoryless -> not exponential

Law firm example: Is it memoryless?

  • Should tire manufacturer pay damages for an accident that happened when tire with 10K miles failed?
  • First need to know if tires generally fail at 10K to understand if is defective, or normal enough tire behavior.
  • Can answer by modeling tire failure:
    • P(tire fail at 10K miles) = ?
    • Assumption: Tires more likely to fail, the more worn out they are.
    • Thus, not memoryless
    • Thus, can't model with exponential distribution
    • Instead we could try Weibull with $k>1$ (which models failure over time)

Queuing example: memoryless?

Potential model params:

  • General arrival distribution $A$
  • General service distribution $S$
  • Number of servers $c$
  • Size of queue $K$
  • Population size $N$
  • Queueing discipline $D$
  • Kendall notation: Standard notation to describe a queueing system.

Other factors that Kendall notation doesn't include:

  • Customers hanging up the phone in frustration
  • Customers "balking" at the queue

Note that Simulation can be used instead for cases where the mathematical model gets too complex (too many factors).

Simulation

Definition: Build a model of something to study and analyze its behavior.

Examples:

  • Manufacturing process
  • Airport lines
  • Freight train dispatching

Types of Simulations

Deterministic: No random. Same inputs give same outputs. Stochastic: Use when system has randomness. Focus of this lesson. Continuous-time: Changes happen continuously.

  • Example: Chemical processes, propagation of diseases
  • Often modeled with differential equations

Discrete-event: Changes happen at discrete time points.

  • Example: Call center simulations (someone calls, worker finishes talking to customer)

Discrete event stochastic simulations

  • Valuable when systems have high variability
  • Using average values isn't good enough.

Simulation Software

Components:

  • Entities: Things that move through the simulation (e.g. bags, people)
  • Modules: Parts of the process (e.g. queues, storage)
  • Actions
  • Resources (e.g. workers)
  • Decision points
  • Statistical tracking

Under the hood: Uses (pseudo) random number generation, tracking, animation, etc

Replications: number of runs of simulation

  • 1 replication = one data point (unrepresentative)
  • Run multiple times to get distribution of outcomes
  • Example: Simulating average daily throughput over # of replications (x=throughput, y=replications)

img

Validation: Use real data to validate your simulation is giving reasonable results.

  • Make sure to check both average AND variability.
  • Real & simulated avg don't match -> problem
  • Real & simulated variance don't match -> problem

Simulation for Prescriptive Analtics

When to use simulation: Asking "what-if" questions

  • Ex: What is change in throughput with $100K investment in faster machines?
  • Ex: Value of hiring an extra call center worker?
  • Ex: Best distribution of baggage tugs to have at the airport (one for each/every-other gate, or floating pool)?
  • Compare simulated options to determine best course of action.
  • Simulation software may use built-in heuristic optimization to automatically optimize (e.g. best buffer size at each step of manufacturing process)

Simulation Comparisons

Steps:

  1. Design different simulations
  2. Run simulations
  3. Compare metrics between simulations. Important to decide what metrics to use here.
    • Ex: Baggage tugs - Success metric is fraction of bags that get to baggage claim in 20min or less.

Considerations:

  • One set of "random" observations could be better than another.
    • Solution: Use the same random seed for each simulation at each step of the replication (1=random1, 2=random2, ...).
  • Model is only as good as quality of input, missing/incorrect data leads to incorrect answer.
    • Ex: Call center simulation - Wrong assumption that all workers answer calls equally quickly. Leads to costly decisions.

How to Validate Simulations

Problem: Simulations can be validated by comparing it to data that already exists. However, experiments often simulate something that doesn't exist in reality. How to validate?

Example: Simulating spread of 2 interacting diseases

Problem 1: Matching disease prevalence of two diseases after simulation period did not work, because:

  • Multiple models matched the outcome after N simulation days, but
  • These models had very different results after N+1 simulation days.
  • If we can't tell which model is right, we can't tell which is a good public health policy.

Solution 1: Use multiple measures of disease spread (not just one)

  • Still use overall prevalance, but add others (distribution of disease cluster size, largest cluster size, etc)
  • Good model must mathc all metrics to show validation.

Problem 2: None of the metrics didn't match, even though medical paramters followed literature.

  • Why? Initial simulation tracked disease prevalance across entire population, but data from literature typically collected data first from volunteers who had one of the diseases and did contact tracing.
  • Literature likely overestimated metrics based on their data collection method. There are likely more isolated examples in reality.

Solution 2: Run 2 simulations

  1. Simulate disease propagation
  2. Given simulated disease propagation, simulate data collection.

Markov Chain Model

Definition: Models transition probabilities between states of a memoryless system.

How it works

For each state $i$ in the model,

  • $p_{ij}$: Transition probability from state $i$ to state $j$
  • $P={p_{ij}}$: Transition matrix (e.g. weather states)

img

  • Given transition matrix, we can calculate probabilty of moving through multiple states.
    • Ex: Given $\pi=(.5,.25,.25)$ as probabilities of sunny, cloud, rainy today, we get $\pi P=(.525,.25,.225)$ for sunny, cloudy, rainy tomorrow. Can keep multiplying for subsequent days ($\pi P^x$).

Steady state

Definition: Long-term probability that the system will be in each state.

  • Apply $P$ and get initial vector back: $$ \pi^*P=\pi^* $$
  • Solve for such a $\pi^*$ vector $$ \pi^*P = \pi^* \ \text{and} \ \Sigma_i\pi_i^*=1 $$
  • Not all markov chains have a steady state.
    • Can't have cyclical behavior
    • Every state must be reachable from all others.

Key Assumption

Markov chains are memoryless. Biggest limitation.

  • Transitions only depend on the most recent state.
  • Most systems do not exhibit this property.
  • More useful when connecting small amounts of information to find larger ones. Example: Web search: States = web pages
    • For web pages $i$,$j$ $$ p_{ij} = p\text{ if page i links to page. 0 if not} $$
    • Markov chain = jumping randomly from page to page
    • Use the steady-state probabilities to rank web pages (Google Page Rank).

Other Uses

Can be used to model things that are not memoryless in the short run, but doesn't matter for a long run system (ex. Model urban sprawl, population, dynamics, disease propagation)

Week 10.1 - Missing Data

Some are more likely to be missing, this leads to bias in the missing data.

Examples:

  • Date of death amongst cancer patients, patients that are alive will have field missing. Naive modeling may give overly pessimistic prediction of life span.
  • Reported income, higher earners typically share less data. Naive model may predict low/medium earners well but not for those with high income.

Dealing with Missing Data

Throw away data points

  • Pro: Easy to implement. Won't introduce errors causing bias.
  • Con: Lose data. Potential for censored or biased missing data.

Use categorical variables to indicate missing data.

With categorical data, we can simply add an extra "MISSING" category to the attribute.

Trickier with continuous data, some options:

  • If you add 0 to missing data, it might pull coefficients one way or another to account for missing data.
  • Solution: Interaction terms
    • Creates two variables: One when data is not missing, and one when data is missing.
    • If you craete interaction term against all variables, you're essentially create two separate models (like a tree split).

img

Imputation

Definition: Estimate missing values

Midrange Value: Choose mean, median (numeric) or mode (categorical)

  • Pros: Hedge against being too wrong. Easy to compute.
  • Cons: Biased imputation (ex. Income level, higher income less likely to report. Taking average will give low real average).
  • Takeaway: More accurate on average, less accurate variability.

Regression: Reduce or eliminate the problem of bias.

  • Pros: Usually gives better value than filling average.
  • Cons: Complex - build, fit, validate, test to estimate missing value. Does not capture all the variability.

Pertubation: Add pertubation to each imputed value.

  • Ex. Normally-distributed variation.
  • Less accurate on average, but more accurate variability.

Imputation approaches:

  • Limit the amount of imputation (no more than 5% per factor)
  • Con: Additional errors from imputation? ie, imputation error + perturbation error + model error.
    • Don't worry, regular data also probably has errors.

Week 10.2 Optimization

  • Optimization is closer to prescriptive analytics, sitting above descriptive and predictive analytics. Can guide leadership's decision making.
  • Building optimization models is hard - software automates the solution but the model build is up to you.

Elements of Optimization Models

Components:

  1. Variables
  2. Constraints
  3. Objective function

1) Variables

Definition: Decisions that the optimization model will find the best values for.

Example: Political candidate scheduling

  • $x_i$: Total time spent in state $i$
  • $y_i$: Number of visits to state $i$
  • $z_i$: 1 if state $i$ is ever visted, 0 if not.
  • $w_{id}$: time spent in state $i$ on day $d$
  • $v_{id}$: 1 if state $i$ visited on day $d$, 0 if not.

2) Constraints

Definition: Restrictions on the decisions that can be made.

  • $\Sigma_i x_i \leq 30$: Only 30 days allotted for campaign.
  • $\Sigma_{d=24}^{30} v_{\text{Florida},d} \leq 3$: Must visit Florida at least 3 times between days 24-30.
  • $\Sigma_d v_{id} = y_i$: Total visits must add up correctly
  • $\Sigma_i (\alpha p_i \sqrt{x_i + \frac{1}{3}\Sigma_{j \in N(i)} x_j} + \beta v_{id}f_{d})$:

Objective function

Definition: A measure of the quality of a solution, ie. the set of variables which we're trying to maximize or minimize Maximize expected new votes.

  • Example: Maximize expected new votes based on campagin schedule.
  • The components of the function needs to be determined by other analytics models (regression, etc). Oftentimes optimization models require outputs from other models as input.

Solution - Output of Optimization Model

Feasible Solution: Variable values that satisfy all constraints.

Optimal solution: Feasible solution with the best objective value.

Examples of Simple Optimization

Example 1 - Diet Problem (US Army)

Problem: Satisfy soldiers' nutritional requirements at minimum cost.

Definitions:

  • $n$: foods
  • $m$: nutrients
  • $a_{ij}$: amount of nutrient $j$ per unit of food $i$
  • $m_j$: minimum daily intake of nutrient $j$
  • $M_j$: maximum daily intake of nutrient $j$
  • $c_i$: per-unit cost of food $i$
  • $a_{ij}x_i$: Amount of nutrient $j$ obtained from food $i$

Optimization model components:

  1. Variables
    • $x_i$: amount of food $i$ in daily diet.
  2. Constraints
    • $\Sigma_i a_{ij}x_i \geq m_j$ for each nutrient $j$
    • $\Sigma_i a_{ij}x_i \leq M_j$ for each nutrient $j$
    • $x_i \geq 0$ for each food $i$
  3. Objective function
    • $\text{Minimize} \Sigma_i c_i x_i$
  4. Other complexities
    • Variety in diets
    • Seasonal cost variation
    • Taste of food
    • Taste of combinations of foods.

Example 2 - Call center scheduling

Problem: Meet forecasted demand $d_i$ for each day of the week $i$

  • Workers work 5 days in a row, then 2 days off
  • Minimize worker-days used

Wrong approach:

  1. Variables
    • $x_i$: number of ppl working on day $i$
  2. Constraints
    • Meet demand: $x_i \geq d_i$ for all days $i$
    • Integer workers: $x_i$ is integer for all days $i$ (can't hire fraction of a worker)
  3. Objective function
    • $\text{Minimize } x_\text{Sunday} + x_\text{Monday} + \ldots + x_\text{Saturday}$

Right approach: Accounts for consecutive worker days. This way, optimization already accounts for union rules.

  1. Variables
    • $x_i$: number of ppl who start working on day $i$
  2. Constraints
    • Meet demand: $\Sigma_\text{j working on day i} x_j \geq d_i$
      • Ex: $x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tue} \geq d_\text{Tue}$
    • Integer workers: $x_i$ is integer for all days $i$ (can't hire fraction of a worker)
    • Integerality: $x_i$ is integer for all days $i$.
  3. Objective function
    • $\text{Minimize } 5(x_\text{Sun} + x_\text{Mon} + \ldots + x_\text{Sat})$

Modeling Binary Variables

Idea: Binary variables allow more complex models.

Integer variables in optimization

  • Fixed charges in objective function
  • Contrainst to choose among options
  • Constraints requiring same/opposite decisions
  • If-then constraints

Linear Regression vs Optimization

  • LR just needs the data.
  • Optimization requires user to design a model (variables, constraints, obj function).

Example: Stock Market Investment

Slide 1:

  • Add fixed charges for each transaction. Add this to objective function $\Sigma_ity_i$ and subtract from existing function.
  • Add a linking constraint to link $x_i$ and $y_i$.

img

Slide 2:

  • Add minimum amount for each stock: $x_i \geq m_i y_i$ for all stocks $i$.
  • Add various investment strategies for specific stocks. These are all reflected as different constraints.

img

Slide 3:

  • If-then constraint: If investing in energy stock, invest in at least 5.

img img

Discovering the Real Questions

  • Oftentimes, we don't get clear paragraph that contains variables, constraints and objective function.
  • Discovery is hard:
    • Too many constraints to remember or say all at once.
    • Easy to forget obscure constraints.
    • Easy to omit "obvious" (to experts) issues.
    • Example: 5 day work week means 5 consecutive days, no partial shifts.
  • Repeat until done
    • Get information
    • Build model
    • Check with stakeholders (do solutions make sense)
    • If yes, done; if no, repeat.

Violating the Constraints:

  • You might find out that the company is violating constraints.
  • Project might shift to complying with laws.

Quiz

Let 𝑦𝑖 and 𝑦𝑗 both be binary variables, and let 𝑥𝑖 be a continuous variable. Which expression is equivalent to the constraint “If 𝑦𝑖=1, then 𝑦𝑗 must also be 1?

A: 𝑦𝑗 ≥ 𝑦𝑖

Week 11 - Optimization Continued

Optimization in Statistics

Linear Regression

img

Stats and Optimization notations differ in what they consider variables and constant coefficients:

img

  • $x_{ij}$ is the data
  • $a_j$ are the coefficients of the model
Statistics POV Optimization POV
$x_{ij}$ are variables $x_{ij}$ are constant coefficients
$a_{j}$ are constant coefficients $a_j$ are variables

Other Regression Models

Lasso / ridge regression / elastic net have constraints, as opposed to linear regression.

img

Logistic Regression

img

SVM

Hard classification:

  • Constraints: Each data point must be correctly classified.
  • Obj Func: Maximize margin between support vectors.

Soft classification:

  • Constraints: No constraints
  • Obj Func: Absorb maximizing margin plus classification errors.

img

Time Series Models

Exponential smoothing model

  • Obj Func: Minimize prediction error

img

ARIMA + GARCH img

K-means for Clustering

img

Note: Some models are mathematically easier to solve than others. This will be covered in a later lecture.

Using Optimization to Customize ML Models

We can use optimization to customize ML models through:

  • Adding custom constraints
  • Setting constraints to include specific features

Note: Lecture applies this to LR, but technique can be applied to other models.

Example: Linear Regression

Customization 1: Adjusting the exponent

  • LR defines best as minimizing sum of squared residuals.
  • This means $a$ coefficients are influenced more by points farther from the line (larger residuals).
  • What if we don't want that? If so, we can adjust the xponent from two to whatever number is appropriate (e.g. 3/2).
  • Then use optimization software to fit best line.
  • Note: Setting exponent to 1 + abs of residual is called minimizing sum of absolute error. Many regression packages already include this.

img

Customization 2: Setting intercept/coefficient constraints

Intercept:

  • Intercept $a_0$ might always need to be zero (e.g. acceleration of vehicle from zero to full power).
  • This can be set as a constraint ($a_0 = 0$) in the regression and use optimization to solve it.

img

Coefficient:

  • Example: $a_1$ must be at least as big as $a_2$, $a_2$ must be at least as big as $a_3$, etc, and the differences between consecutive coefficients should be decreasing.
  • Order of constraint regression often used in chemical engineering / oil drilling / professional sports.

img

Customization 3: Constraints on acceptable values of coefficients (feature selection)

  • Lasso can do this by adjusting $\tau$ (higher=more features, lower=less features), but we can restrict the number directly.
  • We can add this constraint by adding a new optimization variable $w_i$ for each new feature.
  • $w_i$ is binary: 1 if feature is selected, 0 if not.
  • Constraints:
    1. $\Sigma_j^n w_i = F$, ie. select exactly $F$ features
    2. Feature can't have a nonzero coefficient unless it is selected. ($U_i$ and $L_i$ are upper / lower bounds, can be set arbitrarily at first and optimize)

img

Classification of Optimization Models

General Optimization Model

  1. $x$ = vector of variables
  2. $\text{Minimize}f(x)$ or $\text{Maximize}f(x)$
  3. Variables $x$ must belog to set $X$ (ie. $x \in X$)

Linear Program

img

Convex Quadratic Programs

  • $f(x)$ is a convex quadratic function
  • $\text{Minimize}f(x) \text{ or } \text{Maximize}-f(x)$
  • Constraint set $X$ is defined by linear equations and inequalities.
  • Easy/fast to solve, but not as quickly as linear programs.

Convex Optimization Programs

  • Objective function $f(x)$ is:
    • Concave: if maximizing
    • Convex: if minimizing
  • Constraint set $X$ is a convex set
  • Easy to solve, but solutions can take a lot longer.

Integer Program

  • Linear program, plus:
  • Some (or all) variables restricted to take only integer values (can be binary)
  • More difficult to solve even with good software packages.

General Non-Convex Program

  • Optimization problem is not convex
  • Hard to find optimal solutions

Order of Optimization Programs in Difficulty

  1. Linear programs (0: Network models)
  2. Convex quadratic programs
  3. Convex programs
  4. Integer programs
  5. General non-convex programs

What is problem is too hard? Use a heuristic.

  • Rule of thumb process usually gives good solutions.

Network Program

  • Fastest program
  • Operates on a network with nodes and vertices
  • Uses:
    • mapping (ie. shortest path problem),
    • assignment model (e.g. which worker gets which job to maximize efficiently)
    • max flow model (e.g. how much oil can flow thru complex network)

img

Stochastic Optimization (or optimization under uncertainty)

Previously: Params known, forecasts had expected value.

  • Optimization models assume that we know all the values of the input data exactly.
  • But what if data/param isn't known exactly?
  • What if forecast vals aren't known exactly?
  • Sometimes we need to model this uncertainty.

Method 1: Modeling Conservatively

Example: Call center worker assignment

Original constraint: Total # of workers has to be at least enough to meet expected demand.

  • $x_i$: # of workers starting 5-day shift on day $i$
  • $d_i$: expected demand on day $i$
$$x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tues} \geq d_\text{Tue}$$

New constraint: Add a buffer $\theta$, ie. extra workers (just in case):

$$x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tues} \geq d_\text{Tue} + \theta$$

Alteranative constraint: Given known probability distribution, add a chance constraint (ensuring that the probability of meeting a certain constraint is above a certain level). The probability of having enough workers to meet demand must be at least some value $p$.

Method 2: Scenario Modeling

Definition: Define a set of scenarios and optimize over all of them.

Example: Modeling bugs after rollout

Scenario modeling:

  1. 2 small recurring bugs
  2. One major bug 3 days after launch
  3. 2 major, immediate bugs
  4. Catastrophic scenario after 10K signups

Variables:

  • $x_i$: Number of workers starting 5-day shift on day $i$
  • $d_{is}$: Expected deman on day $i$ in scenario $s$

Robust solution: Model that satisfy all scenario demands:

  • Con: This can be expected and unrealistic $$ x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tues} \geq d_\text{Tue,1} \\ x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tues} \geq d_\text{Tue,2} \\ x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tues} \geq d_\text{Tue,3} \\ x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tues} \geq d_\text{Tue,4} $$

Optimize expected cost:

  • In addition to the cost of paying workers, we could also estimate the cost of having too few workers to meet demand and put that into the objective function.
  • Complex: Need to estimate the probability of each scenario occurring and weigh the missed demand costs by each scenario's probability in the objective function.
  • Model could still get pretty large.

img

Method 3: Dynamic Programs

  • Different structure from mathematical programming models (variables, contrainst, obj func).
  • Dynamic program structure:
    1. States (exact situations + values)
      • e.g. State 1: 1 wk into rollout, major bug detected and fixed on first day.
    2. Decisions (choices of next state)
      • e.g. How many ppl to tell that tmrrw we want them to start their 5-day shift.
    3. Bellman's equation: Determine optimal decisions
  • Assumes there's no uncertainty give a state and a decision

Stochastic Dynamic Program:

  • Dynamic program, but decisions have probabilities of next state.
  • We know the probabilities of going from one state to another.

Markov Decision Process

  • Stochastic dynamic program with discrete states and decisions.
  • Probabilities depend only on current state/decision.

Summary: Two general methods

  1. Mathematical programming (those we've seen so far)
  2. Dynamic programming

Basics of Optimization Algorithms (or how to solve the models)

_Note: Lecture describes how most optimization algorithms work at a high level.__

Method to solve optimization algorithm

Step 1: Initialization
  • Find an initial solution by picking values for all of the variables.
  • May lead to very bad solution, variable values may not satisfy all constraints, etc.
  • Idea is to just get an initial solution.
Step 2: Improving from Initial Variables

Repeat the following:

  1. Find an improving direction $t$.
    • Starting with initial solution, find a vector of relative changes to make to each variable. This vector is $t$.
  2. Use a step size $\theta$ to move along it.
    • Vector $t$ is the relative change in each variable, and $\theta$ is the amount that we change.
  3. New solution = old solution + $\theta t$.

img

Continue this process until:

  • Conversion: Solution doesn't change much
  • Time/resource allotted runs out.

Algorithm from Calculus

This optimization algorithm is very similar to Newton's method for finding the roots of a function $f(x)$.

Newton's method:

  • At each step $n$ we have a current solution $x_n$.
  • We find an improving direction: $-\frac{f(x_n)}{f'(x_n)}$ and move one unit in that direction (e.g. $1$).
    • $f'$ is the derivative of $f$
  • The new solution $x_{n+1}$ is $x_{n+1}$ times the improving direction.
$$ x_{n+1} = x_n - 1 \frac{f(x_n)}{f'(x_n)} $$

Where:

  • Current solution $x_n$ at step $n$
  • $1$: Step size
  • $f'(x_n)$: Improving direction

img

Guarantees - Convex vs Non-Convex Problems

Convex optimization problems: Guaranteed to find optimal solution

img

Non-convex optimization problem: Not guaranteed to find optimal solution.

  • May converge to infeasible solution, or
  • Converge to local optimum.

img

Runtime - Integer vs Linear

  • Integer: Can be long. Often tree-based but rely on direction/step-size methods as sub-problems.
  • Linear: Often fast (including network algos)

Week 11.2 Advanced Models

Non-parametric Statistical Tests

Motivation: Use when distribution of responses are unknown. Most hypothesis tests assume that we know the underlying distribution that we're testing against.

Test 1: McNemar's Test

Use: Compare results on pairs of responses (Drug A vs B's effect on disease).

Method:

  1. Run paired experiment,
  2. Throw out all cases where results are the same. Given what's left over, test where we'd expect results this extreme or more extreme just by luck.
  3. Determine probability by p-value (smaller, less by chance)

img

Test 2: Wilcoxon Signed Rank Test for Medians

Assumption: Response function is continuous and symmetric.

Use: Checks whether median of distribution is different from a specific value $m$.

Method: Given responses $y_1, \ldots, y_n$

  1. Find abs val of each response's distance from the mean
    • $|y_1-m|,\dots,|y_n-m|$
  2. Rank them from smallest to largest.
  3. Calculate the sum of all ranks ($W$) where $y_i > m$.
    • $W = \Sigma_{y_i \gt m}\text{rank}(y_i-m)$
  4. Run p-value test for $W$
    • If $p$ is small, median is probably different from $m$
    • p-value calculated via software

Alternative test for paired samples:

  • Given pairs $(y_1, z_1), \ldots, (y_n, z_n)$ from observations $y$ and $z$.
  • Hyp: Do 2 sets of paired samples have the same median?
  • Use $|y_1-z_1|,\dots,|y_n-z_n|$ for rank test.

Use of Wilcoxon vs McNemar:

  1. Wilcoxon: Numeric data
  2. McNemar: y/n data

Test 3: Mann-Whitney Test

Use: Compare 2 data sets that are not paired

Method: Given independent observation $y_1, \ldots, y_n$ and $z_1, \ldots, z_n$

  1. Rank all observations together ($y$ and $z$)
  2. $U$ = smaller of two adjusted rank sums:
$$ U = min\{U_y, U_z\} \\ U_y = \Sigma_{i=1}^n rank(y_i) - \frac{n(n+1)}{2} \\ U_y = \Sigma_{i=1}^m rank(z_j) - \frac{m(m+1)}{2} $$
  1. Find significance of $U$ (need software or table)

Matching Statistical Tests to Situations

Types of Tests

  • Parametric tests
    • Use exact values of the data, e.g. the mean
    • Ex. Student's t-test
  • Non-parametric tests
    • Use ranks of data, e.g. the median
    • E.g. Mann-Whitney, Wilcoxon tests
  • Binomial-based tests
    • Use successes/failures (e.g. fraction of successes)
    • E.g. McNemar's test

Type of data involved

  1. One-sample tests
  2. Two-sample tests (unpaired)
  3. Two-sample tests (paired)

Determining Paired vs Unpaired

Example: Plane tickets from GA to NY

  • Paired:
    • Data 1: Flight costs every Monday of year X
    • Data 2: Flight costs every Thursday of year X
    • Hyp: Prices are higher on Mondays
  • Unpaired: Comparing flight costs
    • Data 1: Flight costs from first 30 Mondays of year X
    • Data 2: Flight costs from last 30 Thursdays of year X img

Determining Type of Test

  1. Parameteric to answer Qs about the mean
  2. Non-parametric to answer Qs about ranking (median)
  3. Binomial to answer Qs about binary outcomes

Which type of test depends on the question: img

Bayesian Modeling

Use: Bayes models are helpful when there is a lack of data.

  • Even with a single observation, combining it with a broader set of observations allow us to make a deduction/prediction.
  • OTOH Bayes prediction also gives counterintuitive results which can be difficult to explain.

Bayes theoerem (ie conditional probability)

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

Example: Medical test for a disease

  • TP: 98%
  • FP: 8%
  • 1% of population really has disease, ie. 8.9% of ppl test positive
  • Q: If someone tests positive, what is the probability they have the disease?

Variables:

  • $A$: Has the disease
  • $B$: Tested positive
$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \\ = \frac{0.98 * 0.01}{0.089} \\ = \text{11%} $$

Results: After testing positive, a person has only an 11% chance of having the disease.

  • Why? So many more people don't have the disease, ie. many more false positives than true positives.

Empirical Bayes Modeling

Use: Overall distribution of something is known/estimated, but only a little data is available.

Example: Predicting outcomes of NCAA basketball tournament

  • Based on outcome of regular season games, predict how much better one team is than another.
  • Data: Difference $X$ in points scored by the home team and road team
    • $X$ is approximately normal: $X~N(m+h, \sigma^2)$
    • $h$: Home court advantage
    • $m$: True difference in teams' strength (unknown)
    • $\sigma^2$: Variance
  • Suppose GT beat NC at home by $x$ points - Want to figure out how much of the point delta is due to diff in team strength.
    • Real diff is $x$, but doesn't account for random factors.

Method:

  1. Look for $P$ of having true diff $M$ given observation $X$.
  2. $P(M|X)$ is the $P$ of observing a point difference of $X$ given a true margin $M$, times the $P$ that true margin is $M$, divided by $P$ that the observed margin is $X$.
  3. True diff $M$ is also normally distributed with mean and variance
  4. Get $\sigma$, $\tau$, $h$ from past data
  5. Integrate distribution from zero to infinity to show the $P$ that GT is the better team.

img

Applied Example: Home team won by 20 points

  • $h$ (home court advantage) = 4 pts
  • $\tau$ (stdev in team strength diff) = 6 pts
  • $\sigma$ (stderr from random variation) = 11 pts
  • Of 20 pt victory,
    • 4pts due to home court $h$
    • About 12.5 pts due to random variation
    • Only 3.5 pts due to difference between teams.
    • So the empirical Bayes model seems to predict that the 20 point winner is slightly more likely to lose if the two teams played again on the other home court.

Intuition:

  • The variance of the difference between team strengths is about 36.
  • The variance due to randomness is about 120.
  • So there's a lot more variance due to randomness.
  • The Why: 20 pt win is likely to happen because of randomness than it is to happen due to team strength diffs, because the amount of variance due to randomness is really high.

Communities in Graphs

Use: Automate finding highly-interconnected subpopulation.

img

Louvain algorithm

Definition: Algorithm to decompose a graph into communities by maximizing the modularity of a graph.

  • Louvain is a heuristic. Not guaranteed to find absolute best solution.

Notation:

  • $ij$: arc between nodes $i$ and $j$
  • $a_{ij}$: weight of arc (importance)
    • If no arc between $ij$, then $a$ is zero.
  • $W_i$: weight of each node equal to the total weight of all the arcs connected to node $i$.
  • $W$: total weight of all of the arcs
  • Modularity: Measure of how w ell the graph is separate into communities/modules that are connected a lot internally, but not connected much between each other.

Modularity is derived: $$ Modularity = \frac{1}{2W}\Sigma_{\text{i,j in same community}} (a_{ij} - \frac{w_i w_j}{2W}) $$

Algorithm:

  1. Initialization: Each step is own community
  2. For each $i$, look to see if moved to another community, how much modularity goes up?
    • $i$ moves to whichever community yields greatest modularity
    • If nothing increase modularity, $i$ stays in same community.
  3. Repeat until no way to improve modularity by moving a node. img

Shift to macro-mode, each community becomea supernode.

  1. Each supernode contain total weight of all nodes inside
  2. Superarcs created between supernodes, weights equal to weight of all arcs inside supernode.
    • There's also a super arc from each super-node to itself with weight equal to the weight of all arcs inside the super-node
  3. Do step 2, but in context of supernode/arcs img

Neural Networks and Deep Learning

Use: React to complex patterns that can't be captured by rules (NLP, speech recognition, CV).

Neural Networks

img

Deep Learning

Definition: Neural networks with $n\gt1$ hidden layers.

Competitive Models

Use: What if the system reacts intelligently? Use analytics to consider all sides of the system.

Examples: img

  1. A company might be using past purchasing data to set the price for a new product it's launching based on competitors prices for similar products. But once they set their price, competitors might adjust their own prices so the company sales could be different from what their first model anticipated.
  2. Government set corporate tax policies and companies react by choosing various ways to store and spend their money. To figure out how much tax revenue the government gets, we need to consider both sides of the decision.
  3. Companies set incentive policies for their employees and employees then act in the best way for themselves.
  4. In the United States, the government sometimes auctions off bandwidth for communications. Companies bidding in their auctions want to pay as little as possible for what they want, but they need to bid out everyone else. Each one of the companies is simultaneously thinking the same way. And to find their best bid price, they need to consider not just their own situation, but what everyone else will do too.
  5. Supply chain where multiple companies are all working to achieve the same final goal while still negotiating prices and payments throughout the supply chain. How should the payments be structured between them so both companies have the right incentives?

Game Theory & Cooperative Game Theory

Definition: Game theory is the study of mathematical models of strategic interactions among rational agents.

Example: BP vs Shell Gas Prices
  • 2 gas stations
  • Set price: \$2.50 or \$2.00
    • Same price results in 50/50 demand
    • Otherwise all demand goes to lower priced station.
  • $d$: Demand
  • What's the best price to choose?

Game 1: Cost = $1.00/gallon

  • Both gas stations gain the most by setting the price lower.
  • Model predicts a Stable Equilibrium: Both stations will lower price, and neither has incentive to change.
    • Even though outcome is worse than if they both agreed to charge higher price, due to prisoner's dilemma (promise then cheat) both sides will stay at the lower price.
    • Math assumes that companies wouldn't have a cartel.
    • Stable equilibrium doesn't always mean each one's outcome is worse than if they colluded.

img

Game 2: Smaller profit margin - Cost = $1.75

  • Both would benefit by setting their prices higher.

img

Game 3: Continuous price range

How analysis works: Given BP choosing a price $p_{BP}$,

  • If Shell chooses a higher price $p_{SHELL} \gt p_{BP}$ profit is zero.
  • If $p_{SHELL} \lt p_{BP}$, Shell's profit is $d(p_{SHELL}-Cost)$
  • If $p_{SHELL} = p_{BP}$, they split the deman and their profit is $\frac{d}{2}(p_{SHELL}-Cost)$
  • Both companies will keep lowering their prices until they reach a stable equilibrium.
  • This is part of the reason that competition tends to drive down prices for consumers or to motivate companies to differentiate their products to give consumers more choices about quality or features.

img

Timing

Simultaneous decision-making: All parties must make a decision at the same time.

  • E.g. Sealed-bid auction

Sequential decision-making: Each party can see each other's strategy before making their decision.

Strategy Types

  • Pure Strategy: Just do one thing (gas station example)
  • Randomized Strategy: Randomize your decision making because competitors will deduce your pattern if using pure strategy (e.g. rock paper scissors)
  • Perfect Information: All parties know everyone else's situation (e.g. chess)
  • Imperfect Information: Information is skewed amongst parties (e.g. not knowing profit margins of competitors)
  • Zero-sum Game: Always a winner and loser
  • Non Zero-sum Game: Possible for all parties to benefit/lose depending on decisions made (e.g. economics).

How to solve competitive situtations? Use different optimization models.

Natural Language Processing

tl;dr - Context is difficult.

Survival Models

Definition: Given predictor variables, predict probability of an event happening (or not) before a certain time.

Use:

  • Insurance: Given person's predictors X, predict probability of lifespan, car crash, theft claim, etc.
  • Medicine: Given patient's predictors X, predict probability patient will live N years after operation.
  • Machinery/sensors: Given climate/location, how often would a machine needs to undergo maintenance.
    • Relation to CUSUM: CUSUM detects when a machine needs immediate maintenance. Survival model can reduce number of times the CUSUM models show the need for immediate maintenance by recommending how often preventative maintenance should occur.

Cox proportional hazard model

Definition: Use an exponential function to estimate the survival probability (like logistic regression).

img

Challenges: Censored Data

Definition: A condition in which the value of a measurement or observation is only partially known

  • No data before/after a specific time
  • No data after a certain number of observed occurrences

Example: Patient $i$ received transplant 12 yrs ago, still alive.

  • What's $y_i$ (length of time patient survived post-transplant)? We don't know yet.

Gradient Boosting

Definition: Using additional models to augment your factor-based model.

Basic idea: Start with 1 model, then augment more models using gradient information to fit these augmented models.

  • Used with factor-based models (regression, classifiction, trees)

Method

  1. Base model $F(x)$ will result in errors $y-F(x)$
    • Can be just a constant.
  2. We use another model to "counteract" the original errors:
    • $H(x)$ fits to original model's errors $y-F(x)$
    • $H(x)$ response: Negative gradient of quality metric evaluated at F's predictions $-\nabla L(F(x))$
      • For LR where quality metric is squared error, the residuals are essentially the negative of the gradient for $F$'s predictions.
      • Other models (like tree) may use a different quality metric, but idea is the same.
  3. To get next function/model, we move the model in that gradient fit direction $H$. We do that by adding original model to augmented model.
  4. New model is also multiplied with step size $\gamma$ (can be amount that minimizes training set prediciton error, or regularization value)
    • This allows each model to move in the right step incrementally.
  5. Keep augmenting new models ($H_k$) until convergence
  6. Final model is original model plus all augmented models.

img

Note: This idea is very similar to optimization

Supplemental Notes from ISLR

  • Boosting is alternative to bagging (ie. random forest for trees)
  • Trees are built sequentially, unlike bagging which is in parallel.
  • Does not do bootstrap sampling like RF.

Gradient Boosting Example

img

  • Example: Regression tree, 2 predictors
  • $L$: Loss function, sum of squared error to measure model quality. We want to minimize this.
    • $F_0$ nets $L=2$

Wiki definition of loss function:

In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function.

img

  • New table, response is negative gradient
  • $H_0$ fits to negative gradient
  • New model is sum of both $F_0(x)$ and $H_0(x)$
  • To calculate $F_1(x)$, need to run data through both $F_0$ and $H_0$ trees.
  • $F_0$ and $H_0$ summed per data point to get final predictions.
    • Once we have these 2 scalars, we don't need to calculate each time, just sum them up.

Optimizing Weights for Boosted Models

Instead of guessing, we can optimize the weight of each boosted model, ie. $\gamma$ (see image).

If we write out L explicitly, we can optimize by taking its first derivative, plugging in the values of $y$ and $F_0$ and $H_0$ for each point, setting it equal to zero and solving for $\gamma_0$.

img

Optimized $\gamma$ is 0.56, which yields $L=0.04$. Much lower loss than just using $F_0$.

img

In [ ]: